Seeking the best Internet Model 
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The models of the Internet reported in the literature are mainly aimed at reproducing the scale- 
free structure, the high clustering coefficient and the small world effects found in the real Internet, 
while other important properties (e.g. related to centrality and hierarchical measurements) are 
not considered. For a better characterization and modeling of such network, a larger number of 
topological properties must be considered. In this work, we present a sound multivariate statistical 
approach, including feature spaces and multivariate statistical analysis (especially canonical pro- 
jections), in order to characterize several Internet models while considering a larger set of relevant 
measurements. We apply such a methodology to determine, among nine complex networks models, 
which are those most compatible with the real Internet data (on the autonomous systems level) 
considering a set of 21 network measurements. We conclude that none of the considered models can 
reproduce the Internet topology with high accuracy. 
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I. INTRODUCTION 

In the Internet, an autonomous system (AS) is a large 
domain of IP addresses that usually belongs to one or- 
ganization such as a university, a private company, or 
an Internet Service Provider. Since AS are connected 
through border routers, the Internet can be considered 
as consisting of interconnected AS. The understanding 
of the fundamental mechanisms that govern the Internet 
evolution and emergence are fundamental for modeling 
and simulating of dynamical process, such as attacks [lj 
and cascade failures as well as for trying to improve 
protocols and routing. 

Large data sets about the Internet connections have 
been available since the 90s. In 1999, Faloutsos et 
al. [3| showed that the distribution of connections is fol- 
low a power law, despite the fact that new vertices and 
edges appear and disappear all the time. This finding 
boosted the modeling and characterization of the Inter- 
net. Among the obtained results, it has been shown that 
the scale-free structure is important for providing net- 
work tolerance to random failures [l|| and traffic conges- 
tion [1, 0|. However, such a topology makes the network 
vulnerable to intentional attacks [f|. At the same time, 
the Internet protocol efficiency is highly influenced by the 
network connectivity, while the power law degree distri- 
bution results in an absence of an epidemic threshold, 
which favors the spreading of computer viruses ■ 

The models proposed to generate the Internet topology 
vary from completely random to those including prefer- 
ential attachments [8j]. Accurate models for the Internet 
are particularly important for growth forecast, architec- 
ture planning and design, and to provide topologies for 
dynamical process simulation. Although the character- 
ization of the Internet structure is becoming more and 
more precise, just a few models can statically reproduce, 
and even so in approximate fashion, the Internet evolu- 
tion 0| . While the current models are mainly aimed at 



the degree distribution, other important features — such 
as those quantified by central and hierarchical measure- 
ments — have not teen considered in these models. This 
approach can result in inaccurate and incomplete models. 
For instance, Alderson et al. [13] showed that networks 
with the same number of vertices and edges, but distinct 
structure, can present the same degree distribution (see 
also [ll|]). I n this way, the fact that a model reproduces 
the same degree distribution as the real network is not 
enough to validation. This suggests that most current 
Internet models can be biased, undermining endeavors 
such as the prediction of Internet evolution and dynami- 
cal simulations. In this paper, we apply an alternative ap- 
proach to determine the accuracy of network models, by 
considering multivariate statistical analysis and Bayesian 
decision theory [H E Q • 



Multivariate statistical methods have not been consid- 
ered by complex networks researchers until recently. The 
application of such methods in classification of network 
has been suggested recently (e.g. [H, 0, E3])- Multi- 
variate statistical methods allow the consideration of a 
large set of variables and can be of great help for net- 
work modeling. Indeed, a model can be considered as 
being accurate if it can generate networks whose struc- 
tural properties — quantified by a large set of network 
measurements — are statistically similar to those found 
for the real network being considered. 



In this work we present the application of multivari- 
ate statistical methods, namely canonical projections and 
Bayesian decision theory, in order to determine which 
among a set of Internet models is the most appropriated 
to generate AS topologies. We considered nine different 
complex networks models and a set of 21 measurement 
in our analysis. 



2 



II. CONCEPTS AND METHODS 



degrees, i.e., 



The considered Internet database, defined at the level 
of autonomous systems (AS), is available at the web 
site of the National Laboratory of Applied Network Re- 
search (http : //www.nlanr .net). The data was collected 
in February 1998, with the network containing 3522 ver- 
tices and 6324 edges. For the network characterization, 
we took into account a set of 21 network measurements: 
(i) (k), average vertex degree; (ii) k max , maximum de- 
gree, (iii) (cc), average clustering coefficient; (iv) k nn , 
average neighbor connectivity; (v) £, average shortest 
path length; (vi) r, assortative coefficient; (vii) (B), aver- 
age betweenness, (viii) cd, central point dominance; (ix) 
st, straightness coefficient of the degree distribution; (x) 
(k 2 ), hierarchical degree of level two; (xi) (ccg), hierar- 
chical clustering coefficient of level two; (xii) CV2, con- 
vergence ratio of level two; (xiii) dv%, divergence ratio 
of level two; (xiv) E2, average inter- ring degree of level 
two; (xv) A2, average intra-ring degree of level two; (xi) 
(k^), hierarchical degree of level three; (xvii) (003), hi- 
erarchical clustering coefficient of level three; (xviii) CV3, 
convergence ratio of level three; (xix) dv3, divergence ra- 
tio of level three; (xx) E3, average inter- ring degree of 
level three; and (xxi) ^3, average intra-ring degree of 
level three. The classification was obtained by consid- 
ered canonical variable analysis and Bayesian decision 
theory [H E, El . 



A. Network measurements 

The AS network can be represented in terms of its 
adjacency matrix A, whose elements are equal to one 
whenever there is a connection between the vertices i and 
j, or equal to 0, otherwise. The average vertex degree is 
given as 



(i) 



The clustering coefficient of a node i (ccj) is defined by 
the proportion of links between the vertices within its 
neighborhood, divided by the number of links that 
could possibly exist between them (ki(ki — l)/2). The 
average clustering coefficient is computed as 
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The average neighbor connectivity (k nn ) measures the 
average degree of vertices neighbor of the each vertex in 
the network [l8| . The average shortest path length (I) is 
calculated by taking into account the shortest distance 
between each pair of vertices in the network. The assor- 
tative coefficient measures the correlation between vertex 



'(3) 

The straightness coefficient (st) quantifies the level to 
which a log-log distribution of points approaches a power 
law, which is computed in terms of the Pearson correla- 
tion coefficient of the loglog degree distribution [lfjj ]. 

The considered centrality measurements are based on 
the betweenness centrality, which is defined as 



<?(i, j) 



(4) 



where a(i,u,j) is the number of shortest paths between 
vertices i and j that pass through vertex u, is the 

total number of shortest paths between i and j, and the 
sum is over all pairs i, j of distinct vertices. The average 
betweenness centrality ((B)) is computed considering the 
whole set of vertices in the network. The central point 
dominance is defined in terms of the betweenness by the 
following equation, 
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CD = 



JV-i 4^ 



(B n 



Bi). 



(5) 



where B max represents the maximum betweenness found 
in the network. 

Complex networks measurements can also be defined 
in a hierarchical (or concentric) way [3, 0, H3, HH , i.e. 
by considering the successive neighborhoods around each 
node. Therefore, it is interesting to define the ring of ver- 
tices Rd(i), which is formed by those vertices distant d 
edges from the reference vertex i. The hierarchical de- 
gree at distance d (kd(i)) is defined as the number of 
edges connecting the rings Rd(i) and Rd+i(i). The hier- 
archical clustering coefficient is given by the number of 
edges in the respective d-ring (m<j(i)), divided by the to- 
tal number of possible edges between the vertices in that 



ring, 



cc d (i) 



2m d (i) 



n d (i)(n d (i) - 1) ' 



(6) 



where rid(i) represents the number of vertices in the ring 
Rd(i)- The convergence ratio at distance d of i corre- 
sponds to the ratio between the hierarchical degree at 
distance d — 1 and the number of vertices in the ring 
Rd(i), 



cv d (i) 



fcrf-l(z) 

n d (i) 



(7) 



The divergence ratio corresponds to the reciprocal of the 
convergence ratio, i.e., 



dv d (i) 



n d (i) 
kd-i(i)' 



(8) 
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Finally, the average inter ring degree is given by the av- 
erage of the number of connections between each vertex 
in the ring Rd(i) and those in Rd+i(i), 



E d {i) 



kd(i) , 
n d (i)'' 



(9) 



and the average intra ring degree is defined as the average 
among the degrees of the vertices in the ring Rd(i), 



A d {i) 



2m d (i) 
nd(i) 



(10) 



The average of each hierarchical measurements is ob- 
tained by taking into account the local hierarchical mea- 
surement of each vertex in the network. 



B. Network models 

The following nine complex network types are consid- 
ered for modeling the Internet: 

1. Erdos-Renyi random graph (ER): The network is 
constructed connecting each pair of vertices in the 
network with a fixed probability p [13], where each 
pair of vertices is selected at random only 
once. This model generates a Poisson degree dis- 
tribution. 

2. Small-world model of Watts and Strogatz (WS): To 
construct this small-word network, one starts with 
a regular lattice of N vertices in which each ver- 
tex is connected to k nearest neighbors in each di- 
rection. Each edge is then randomly rewired with 
probability p [H]. 

3. Waxman geographical Internet model (WGM): Geo- 
graphical networks can be constructed by distribut- 
ing TV vertices at random in a 2D space and con- 
necting them according to the distance. The model 
suggested by Waxman to model the Internet topol- 
ogy (HI considers the probability to connect two 
vertices i and j, distant Dij, as P{i — > j) ~ 

4. Barabdsi- Albert scale-free model (BA): The net- 
work is generated by starting with a set of m ver- 
tices and, at each time step, the network grows with 
the addition of a new vertice with m links. The 
vertices which receive the new edges are chosen fol- 
lowing a linear preferential attachment rule, i.e. the 
probability of the new vertex i to connect with an 
existing vertex j is pro por tional to the degree of j, 

5. Limited scale-free model (LSF): The network is gen- 
erated as in the BA model but the maximum degree 
is limited in order to be equal to the degree of the 
real network 12611. 



6. Scale-free model of Dorogovtsev, Mendes and 
Samukhin (DMS): This network is constructed as in 
the BA model, but the preferential attachment rule 
is defined as V{i — > j) = (kj + k )/^ u (k u + k ) 
(27| . The constant fco controls the initial attrac- 
tiveness and provides variation of connectivity from 
—to < k < oo, allowing a larger variation in the 
exponent of the power law, 7 = 3 + kg/m (for the 
BA model, 7 = 3). 

7. Nonlinear scale-free network model (NLSF): The 
network is constructed as in the BA model, but in- 
stead of a linear preferential attachment rule, the 
vertices are connected following a nonlinear pref- 
erential attachment rule, i.e., Pi^j = kj / J2 u ^u- 
In this case, while for a < 1, the network has a 
stretched exponential degree distribution, for a > 1 
a single site connects to nearly all other sites [28]. 

8. The geographic directed preferential Internet topol- 
ogy model (GdTang): This internet generator con- 
structs direct AS networks by considering some 
rules of the BA model. At each time step, a new 
vertex i and m edges are added to the network. 
The new vertex i connects with a vertex j accord- 
ing to the the rule P^j = k° ut /J2 u K ut - The 
remaining to — 1 edges connect any vertex in the 
network according to the rule: the outgoing end- 
point of each edge (node i) is chosen with proba- 
bility Pi = kj 1 / J2 U an d the incoming endpoint 
(node j) with P 3 = fcf With probabil- 
ity /?, the added edge is local and the endpoints are 
restricted to the same region. The nodes are spa- 
tially distributed considering a pre-defined distri- 
bution. On the other hand, with probability 1 — /?, 
the edge is global and can connect any endpoints. 
With probability p, each added edge may become 
a undirected edge |29l |. 

9. The Inet internet topology generator: The Inet 3.0 
has been based on the AS growth analysis since 
November 1997. Basically, this model assumes an 
exponential growth rate of the number of AS and 
it is computed the number of months t necessary 
to obtain a network with N vertices. Next, the 
out-degree frequency and the rank out-degree dis- 
tribution are calculated. A fraction of n vertices 
are assigned to degree one and the remaining ver- 
tices are assigned out-degrees according to the out- 
degrees frequency. More details about this model 
can be found in [il. [3oT|. 

The models (iv)-(ix) produce networks with power law 
degree distributions as observed in the Internet. The 
models (i)-(iii) are considered in the current network clas- 
sification because of their ability to reproduce network 
topological properties such as the small world effect and 
the high average clustering coefficient values. The NLSF 
model is simulated considering the exponents of the pref- 
erential attachment equal to a = 0.5 and a = 1.5. The 
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models WGM, GdTang and Inet were developed specif- 
ically to generate Internet topologies. Despite GdTang 
generates directed networks, we symmetrize the connec- 
tions — directed connections were transformed in undi- 
rected. This transformation does not alters the net- 
work structure. All considered networks were formed by 
N = 3522 vertex and the average vertex degree adjusted 
to that of the original network ((kAs) = 3.59). 

C. Classification methodology 

A multivariate statistical method was adopted in or- 
der to associate (through classification) the Internet to 
the most likely among the considered models [lj|. The 
classification was obtained by associating the real net- 
work to the model which best reproduces its topology, 
as quantified by the measurements. The features space 
was defined for 10 classes (the nonlinear model is de- 
fined considering two different exponents for the prefer- 
ential attachment). For each model, 50 networks were 
generated and 21 measurements were computed. In this 
way, each network model realization was represented by 
a feature vector composed by 21 elements in the space 
of attributes. Such a space was projected into 2D by 
using canonical variable analysis [la . l3l| and the region 
of classification was obtained by Bayesian decision the- 
ory [13, [H. 

Canonical analysis has been used to reduce the di- 
mensionality of the measurement feature space. It pro- 
vides a powerful extension of principal component analy- 
sis [3l|, performing projections which optimize the sepa- 
ration between known categories of objects. To perform 
the canonical analysis it is necessary to construct a ma- 
trix which quantifies the variation inside the groups pre- 
viously defined, and a second matrix which quantifies the 
variation among these groups. If we consider C classes 
(network models), each one identified as C,, i = 1, . . . , C, 
and that each network realization n is represented by its 
respective feature vector x n = (x\,X2, . . . ,x p ) T , the in- 
traclass scatter matrix is defined as 

S'intra = ^ if n ~ ^ i) if n ~ ^i) ' ^ 

and the interclass scatter matrix is given as, 

Sinter = ]T N t (fi} { - (x)) ((x) . - (x)^ , (12) 
i=l 

where (x) i corresponds to the average of a given variable 

for the class i and (x) is the general average of a given 
variable for all classes. 

By computing the eigenvectors of the matrix 
S^^Sinter and selecting those corresponding to highest 
absolute eigenvalues, Ai, . . . , X p , it is possible to project 
the set of variables into less dimension — usually 2 or 3 



dimensions, depending on the number of highest eigen- 
values considered fl3| . 

The Bayesian decision is performed in order to ob- 
tain the regions of classification by considering non- 
parametric estimation In this, case the mass proba- 
bilities Pi, which corresponds to the probability that an 
network belongs to class C;, as well as the conditional 
probability densities, p(x~* n \Cj). are estimated by using 
non-parametric methods (see fl2l. fl3|). The Bayes rule 
can then be expressed as: 

if f(x n \C a )P(C m ) = m&x b=1 , m {f(x n \C b )P(C b )} 

then select C a , 

where the vector that stores the network set of 

measurements and C a is the class of networks associated 
to the model a. Further details about such an approach 
are discussed in [T3 |. 

III. RESULTS AND DISCUSSION 

The network models were generated while considering 
parameters that best approximate the average vertex de- 
gree and / or the average clustering coefficient of the real 
network. In this way, we considered fpr each model: (i) 
ER, p = (k AS )/(N - 1); (ii) SW, k ~ (k AS )/2 = 2 
and p = 1 - [{cc as )(Ak - 2)/(3k - 3)] 1 / 3 ; (iii) BA, 
to ~ (k AS )/2 = 2; (iv) WGM, the parameters A = 1.35 
and 9=1 were adjusted in order to obtain a degree sim- 
ilar to the real network; (iv) LSF, m ~ (k A s)/2 = 2 
and the maximum degree was taken equal to that ob- 
served in the real network; (v) DMS, to ~ (k A s)/2 = 2 
and fco = m(jAS — 3), where j A s = 2.2 is the expo- 
nent of the degree distribution of Internet [HI ; (vi) KP, 
to ~ (k A s)/2 = 2 and the coefficient of the nonlinearity 
was taken a = 0.5 and a = 1.5; (vii) GdTang, p = 0.5 
and (3 = 0.07; and (viii) Inet 2.0, the fraction of vertices 
with degree equal to one was defined as observed in the 
Internet. The measurements (k A s) and (cc A s) are the av- 
erage degree and the average clustering coefficient found 
in the Internet, respectively. For each model, 50 net- 
works were generated and a set of 21 different measure- 
ments were computed for each one (nine non-hierarchical 
and 6 hierarchical, where the hierarchical measurements 
consider the second and third hierarchies). 

TableUpresents the five most commonly used measure- 
ments for network characterization. According to their 
values, we may conclude that the Inet 3.0 is the most 
accurate model, in spite of (cc) = 0. However, such a set 
of measurements does not quantify the majority of net- 
work properties and a larger set of measurements must 
be considered in order to enhance the precision of the 
analysis. 

In order to obtain the classification of the Internet by 
using canonical variable analysis and Bayesian decision 
theory, according to the set of models and measurements, 
we took into account the following eight measurements 
configurations: 



5 



1, \ktnax-i 

i ""max-) 

(cc),£,r, c D } 

3. {(cc),k nn ,£, c D ,st} 

4. {k max , (cc), k nn , £, r, {B), si} 

5. {£;„„, £,r, (5), (fc 2 ), (cc 2 ), (fc 3 ), (cc 3 )} 

6. {(fc},fe maX) {cc),£,r,c D , (k 2 ), (cc 2 )} 

7. {(fe), (cc 2 >, (cu 2 ), (JSfc), (A 2 ), (As), (cc 3 ,)(cu 3 ), 

(^3},<A 3 >} 

8. {(fc),fe TO(M , (cc),k nn ,£,r, (B),c D ,st, (k 2 ), (cc 2 ),cv 2 , 
E 2 ,A 2 ,(k 3 ),(cc 3 ),cv 3 ,E 3 ,A 3 }. 

Figures Q] and [2] present the obtained partitions and 
classifications. As we can see, different classifications 
were obtained depending on the set of measurements con- 
sidered. For the set (i) and (ii) (Figure [2^ a) and E£b)), 
the Internet was best represented by the model Inet 3.0. 
Indeed, this result is observed in Table [J and reflects the 
biased classification when a reduced set of measurements 
is considered. The Inet reproduces well some topological 
measurements ((fc), k max ,£, r), while other measurements 
( (c) and C£>) tend to diverge. When the sets (vi) and 
(vii) are taken into account, the Internet is best mod- 
eled by the ER network model (Figures [21(b) and Etc)). 
This classification was not expected, since ER model pro- 
duces networks with topology different from the Inter- 
net (see Table H]) . In case the measurements (iii) , (iv) 
and (viii) are considered, the Internet was classified as 
KP(a = 1.5) (Figures [Heboid) and^d)). Indeed, this 
model considers the non- linear preferential attachment, 
which has been considered in other Internet models, such 
as that developed by Zhou and Mondragon [12] — which 
was not considered here because it is suitable to repro- 
duce only CAIDA networks [33]. For the set of measure- 
ments (v), the Internet was classified as BA model, even 
if the BA model did not produce assortative networks 
with high average clustering coefficient and degree distri- 
bution with the same exponent as observed in Internet 
(iBa = 3 and jas = 2.2). In none of the classifica- 
tions, the real network was placed among the points that 
defined each class. All these results suggest that none 
of the models can reproduce the Internet topology with 
high accuracy. The ER, BA, NLSF (a = 1.5) and Inet 
3.0 can reproduce just some topological properties of the 
real network. Therefore, such models can be considered 
as roughly approximated. For a given model to repro- 
duce the Internet structure with precision, whatever the 



set of measurements considered, the network would have 
to be classified as corresponding to this model. Our re- 
sults suggest that a revision of Internet modeling must 
be considered in order to obtaining improved prototypes. 
A possibility to obtain a better model of Internet is to 
observe which of the properties of the ER, BA, NLSF 
and Inet 3.0 are important for Internet evolution. In this 
hybrid model may be constructed. 
IV. CONCLUSIONS 

In this work we presented an application of multivari- 
ate statistical analysis to determine, among a set of pre- 
defined complex networks models, which of them is po- 
tentially most suitable to represent the Internet topology. 
Our results suggest that none of the considered models 
reproduce all considered features of the Internet. Even 
models developed specifically to reproduce the Internet 
structure — such as the Inet, WGM and GdTang — do 
not seem to be very accurate. In order to obtain more 
precise modeling, hybrid models can be constructed, con- 
sidering properties of the ER, BA, NLSF and Inet 3.0 
that are important for Internet evolution, as these models 
were the only that reproduced, partially, some Internet 
topological properties. 

The present work suggests that a revision in Internet 
modeling, which can be assisted by the methods con- 
sidered in this work. Also, it is possible to extend our 
approach by considering the contribution of each mea- 
surement for the separation in the phase space as a sys- 
tematic methodology for identifying the incompleteness 
of the models. This approach can result in incremen- 
tal improvements, allowing to quantify the importance 
of each measurement in discrimination. The extension 
of the modeling methods for other types of complex net- 
works, such as social and biological, is straightforward. 
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FIG. 1: Classification obtained considering different set of measurements. The network realizations are represented by dots, 
corresponding to the following models: + ER, x WS, BA, ♦ WGM, LSF, A DMS, V NLSF {a = 0.5), □ NLSF (a = 1.5), 
o GdTang and * Inet 3.0. The real network is represented by <. The set of measurements in each case are (a) {k max ,£, r}, 
(b) {{k),k max , (cc),i, r, c D }, (c) {(cc) , k nn ,£,c D , st} and (d) {kmax , (cc) , k nn , £, r, (B) , si} 
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FIG. 2: Classification obtained considering different sets of measurements. The network realizations are represented by dots, 
corresponding to the following models: + ER, x WS, BA, ♦ WGM, LSF, A DMS, V NLSF (a = 0.5), □ NLSF 
(a = 1.5), o GdTang and * Inet 3.0. The real network is represented by <. The set of measurements in each case are (a) 
{k nn ,£,r, (B), (fcj), (cc 2 ), (fa), (ccs)}, (b) {(k),k max , (cc),£,r,c D , (k 2 ), (cc 2 )}, (c) {(k 2 ), (cc 2 ), (cv 2 ), (E 2 ), (A 2 ), (k 3 ), (cc3,)(cv s ), 
(Es), (A3}} and (d) {(k),k ma x, (cc),k nn ,£,r, (B),CD,st, (k 2 ), (cc 2 ),cv 2 , E 2 ,A 2 , (A3), (ccs), cvz, E3, A3}. 



